This competition dataset is a model-synthesized version of Cirrhosis Patient Survival Prediction (Dickson et al., 2023), and the original version (Fleming et al., 2013), also posted in Kaggle initially from the Mayo Clinic, collects 418 patients with primary biliary cirrhosis (PBC) of the liver carried out from 1974 to 1984. The competition dataset has 19 columns (6 categorical, 13 numerical) and 5,270 rows, compared to 418 in the original. The goal is to predict the patient’s cirrhosis status.
Competition link: https://www.kaggle.com/competitions/playground-series-s3e26
Original link: https://www.kaggle.com/datasets/joebeachcapital/cirrhosis-patient-survival-prediction/data
Numerical features:
| Feature | Description |
|---------------|-------------|
| N_Days | Day counts from registration to the day of record (dead, transplanted, or alive). |
| Age | Patient's age in days. |
| Bilirubin | Bilirubin level; high levels for a long period can indicate severe liver disease. |
| Cholesterol | Cholesterol level; high levels can progress steatosis into more severe stages. |
| Albumin | Albumin protein level in blood; low levels can indicate liver disease. |
| Copper | Copper level in serum/liver; deficiency relates to liver diseases. |
| Alk_Phos | High Alkaline phosphatase (ALP) enzyme level can indicate liver disorders. |
| SGOT | High Glutamic-oxaloacetic transaminase (SGOT) level could indicate liver damage. |
| Tryglicerides | Level of esters produced by the liver; high levels can indicate steatosis. |
| Platelets | Platelet count; often low in cirrhosis patients. |
| Prothrombin | Prothrombin time (PT); prolonged time indicates severe liver damage. |
Categorical features:
| Feature | Description |
|--------------|-------------|
| Sex | Patient's biological gender: Female or Male. |
| Drug | Medication type: D-penicillamine or Placebo. |
| Ascites | Presence of excess abdominal fluid. |
| Hepatomegaly | Presence of an enlarged liver. |
| Spiders | Presence of spider angiomas. |
| Edema | N (no edema), S (edema without diuretics), or Y (edema despite diuretic therapy). |
| Status | Patient's status on record day: C (alive), D (dead), CL (liver transplant & alive). |
| Stage | Disease stage: 1 (Steatosis), 2 (Fibrosis), 3 (Cirrhosis), 4 (Liver Failure). |
library(ggplot2)
library(dplyr)
library(corrplot)
# Check the proportion of the target class.
par(mfrow=c(1,2))
competition_status <- table(df_cirrhosis_competition_train$Status)
origin_status <- table(df_cirrhosis_origin$Status,exclude = NULL)
labels_competition <- paste(names(competition_status),
round(100 * competition_status / sum(competition_status), 1), "%")
labels_origin <- paste(names(origin_status),
round(100 * origin_status / sum(origin_status), 1), "%")
pie(competition_status, labels = labels_competition, main = "Competition Status")
pie(origin_status, labels = labels_origin, main = "Origin Status")
# Two datasets are unbalanced; we further check if they are duplicated.
df_cirrhosis_origin_no_id <- df_cirrhosis_origin %>% dplyr::select(-ID)
df_cirrhosis_competition_train_no_id <- df_cirrhosis_competition_train %>% dplyr::select(-id)
identical_rows_no_id <- intersect(df_cirrhosis_origin_no_id, df_cirrhosis_competition_train_no_id)
# nrow(identical_rows_no_id)
# Chkeking missing data
# any(is.na(df_cirrhosis_competition_train))
# any(is.na(df_cirrhosis_competition_test))
# any(is.na(df_cirrhosis_origin))
# The origin dataset has 12 columns that have missing values.
# To observe the outlier and distribution together, we use violin plot for numerical features
numerical_features<-c("N_Days","Age","Bilirubin","Cholesterol","Albumin","Copper","Alk_Phos","SGOT","Tryglicerides",
"Platelets","Prothrombin")
categorical_features<-c("Sex","Drug","Ascites","Hepatomegaly","Spiders","Edema","Stage")
# For categorical features, we use histogram to observe the difference amount of class
df_cirrhosis_competition_train$Source <- 'Competition Train'
df_cirrhosis_origin$Source <- 'Origin'
combined_df <- bind_rows(df_cirrhosis_competition_train, df_cirrhosis_origin)
# TO DO: Add thresholds
for (feature in numerical_features) {
p <- ggplot(combined_df, aes(x=Source, y=combined_df[[feature]], fill=Source)) +
geom_violin(trim=FALSE) +
labs(title=paste("Violin Plot of", feature, "by Source"),
x="Source",
y=feature) +
theme_minimal()
print(p)
}
for (feature in categorical_features) {
p2 <- ggplot(combined_df, aes(x=.data[[feature]], fill=Source)) +
geom_bar(position="dodge") +
labs(title=paste("Bar Plot of", feature, "by Source"),
x=feature,
y="Count") +
theme_minimal()
print(p2)
}
# Drop the Source col
df_cirrhosis_competition_train$Source <- NULL
df_cirrhosis_origin$Source <- NULL
# Draw correlation matrix for both dataset
df_cirrhosis_origin <- na.omit(df_cirrhosis_origin)
df_cirrhosis_competition_train$Status <- factor(df_cirrhosis_competition_train$Status, levels = c("C", "CL", "D"), labels = c(0, 1, 2))
df_cirrhosis_competition_train$Status <- as.numeric(as.character(df_cirrhosis_competition_train$Status))
df_cirrhosis_origin$Status <- factor(df_cirrhosis_origin$Status, levels = c("C", "CL", "D"), labels = c(0, 1, 2))
df_cirrhosis_origin$Status <- as.numeric(as.character(df_cirrhosis_origin$Status))
par(mfrow=c(1,1))
# Calculate the correlation matrix for the competition train dataset
cor_matrix_competition <- cor(df_cirrhosis_competition_train[,c(numerical_features,"Status")], use = "complete.obs")
corrplot(cor_matrix_competition, method = "color", type = "upper", order = "hclust",
tl.col = "black", tl.srt = 55)
# print(cor_matrix_competition)
# Calculate the correlation matrix for the origin dataset
cor_matrix_origin <- cor(df_cirrhosis_origin[,c(numerical_features,"Status")], use = "complete.obs")
corrplot(cor_matrix_origin, method = "color", type = "upper", order = "hclust",
tl.col = "black", tl.srt = 55)
# print(cor_matrix_origin)
# Conclusion: The proportion between competition and original is similar, which confirms their relation.
# Column with strong outliers: Bilirubin, Cholesterol, Copper, Alk_Phos,Tryglicerides
# Column with medium outliers: SGOT, Prothrombin, Platelets(original)
As the histogram, violin plot and correlation matrix shown:
The biological data(numerical features) reflects the medical reference; the outliers could indicate the patient’s severity or death. However, our target is classification; all patients are diagnosed with liver disease. If we can find a direct association between the bio feature and Status, we can use it as a severity marker, but we don’t find any.
The origin data is very similar to the competition one and does not duplicate the original data, so we might use it as a control group for reference purposes to validate whether the model-synthesized data is closed. Alternatively, we can merge those two to enhance the sample size of the minor class.
The target class is unbalanced and requires adjusting its portion before building the model by Oversampling CL and undersampling C.
Lastly, we will remove outliers since their occurrence is more of a quality issue than a nature indication.
Column with strong outliers: Bilirubin, Cholesterol, Copper, Alk_Phos, Tryglicerides
Column with medium outliers: SGOT, Prothrombin, Platelets(original)
# Remove outliers (numerical features)
# Apply Grubbs' Test to check my outlier observation
library(outliers)
library(missForest)
library(dplyr)
library(caret)
# Encoding (categorical features)
# Sex; F:0 M:1
df_cirrhosis_origin$Sex <- factor(df_cirrhosis_origin$Sex, levels = c("F", "M"), labels = c(0, 1))
df_cirrhosis_competition_train$Sex <- factor(df_cirrhosis_competition_train$Sex, levels = c("F", "M"), labels = c(0, 1))
# Drug; D-penicillamine:0 Placebo:1
df_cirrhosis_origin$Drug <- factor(df_cirrhosis_origin$Drug, levels = c("D-penicillamine", "Placebo"), labels = c(0, 1))
df_cirrhosis_competition_train$Drug <- factor(df_cirrhosis_competition_train$Drug, levels = c("D-penicillamine", "Placebo"), labels = c(0, 1))
# Ascites; N:0, Y:1
df_cirrhosis_origin$Ascites <- factor(df_cirrhosis_origin$Ascites, levels = c("N", "Y"), labels = c(0, 1))
df_cirrhosis_competition_train$Ascites <- factor(df_cirrhosis_competition_train$Ascites, levels = c("N", "Y"), labels = c(0, 1))
# Hepatomegaly; N:0, Y:1
df_cirrhosis_origin$Hepatomegaly <- factor(df_cirrhosis_origin$Hepatomegaly, levels = c("N", "Y"), labels = c(0, 1))
df_cirrhosis_competition_train$Hepatomegaly <- factor(df_cirrhosis_competition_train$Hepatomegaly, levels = c("N", "Y"), labels = c(0, 1))
# Spiders; N:0, Y:1
df_cirrhosis_origin$Spiders <- factor(df_cirrhosis_origin$Spiders, levels = c("N", "Y"), labels = c(0, 1))
df_cirrhosis_competition_train$Spiders <- factor(df_cirrhosis_competition_train$Spiders, levels = c("N", "Y"), labels = c(0, 1))
# Edema; N:0, S:1, Y:2 Label encoding
df_cirrhosis_origin$Edema <- factor(df_cirrhosis_origin$Edema, levels = c("N", "S", "Y"), labels = c(0, 1, 2))
df_cirrhosis_competition_train$Edema <- factor(df_cirrhosis_competition_train$Edema, levels = c("N", "S", "Y"), labels = c(0, 1, 2))
# Stage; 1 ,2, 3, 4 Label encoding
df_cirrhosis_origin$Stage <- factor(df_cirrhosis_origin$Stage, levels = c(1, 2, 3, 4))
df_cirrhosis_competition_train$Stage <- factor(df_cirrhosis_competition_train$Stage, levels = c(1, 2, 3, 4))
# Imputing orgin dataset
# For numeric columns like Cholesterol, Copper, Alk_Phos, SGOT, and Tryglicerides, the missing values are about 1/4 , which is a lot! If imputing mean or median might not reflect the original data. Hence, I will try to use Random Forest for imputing
na_na_numerical_cols <- c("Cholesterol","Copper","Alk_Phos","SGOT","Tryglicerides","Platelets","Prothrombin")
na_categorical_cols <- c("Drug","Ascites","Hepatomegaly","Spiders","Stage")
impute_missing_values <- function(df, features) {
imputed_data <- missForest(df[, features, drop = FALSE], maxiter = 30, ntree = 100)
df[, features] <- imputed_data$ximp
return(df)
}
# Numerical data imputing
df_cirrhosis_origin <- impute_missing_values(df_cirrhosis_origin,na_na_numerical_cols)
# Categorical data imputing
df_cirrhosis_origin <- impute_missing_values(df_cirrhosis_origin,na_categorical_cols)
# any(is.na(df_cirrhosis_origin))
# any(is.na(df_cirrhosis_competition_train))
# Outlier Test
perform_grubbs_test <- function(df, feature) {
test_result <- grubbs.test(df[[feature]], type = 10) # Two side
return(list(feature = feature, grubbs_result = test_result))
}
# Outlier Test Before
df_competition_train <- lapply(numerical_features, function(f) perform_grubbs_test(df_cirrhosis_competition_train, f))
df_origin <- lapply(numerical_features, function(f) perform_grubbs_test(df_cirrhosis_origin, f))
df_competition_train <- do.call(rbind, lapply(df_competition_train, function(x) data.frame(Feature = x$feature, p.value = x$grubbs_result$p.value)))
df_origin <- do.call(rbind, lapply(df_origin, function(x) data.frame(Feature = x$feature, p.value = x$grubbs_result$p.value)))
before_outlier <- merge(df_competition_train, df_origin, by = "Feature", suffixes = c("_Competition", "_Origin"))
print(before_outlier)
## Feature p.value_Competition p.value_Origin
## 1 Age 1.000000e+00 8.536309e-01
## 2 Albumin 1.742075e-02 1.350112e-02
## 3 Alk_Phos 9.371014e-07 1.057881e-06
## 4 Bilirubin 9.966206e-08 5.090156e-06
## 5 Cholesterol 1.114592e-09 8.666631e-08
## 6 Copper 1.152593e-07 1.879024e-06
## 7 N_Days 1.000000e+00 1.000000e+00
## 8 Platelets 1.000000e+00 1.520915e-01
## 9 Prothrombin 0.000000e+00 7.649354e-09
## 10 SGOT 7.960994e-09 1.805444e-07
## 11 Tryglicerides 0.000000e+00 3.309353e-12
remove_outliers <- function(df, features) {
for (feature in features) {
Q1 <- quantile(df[[feature]], 0.25, na.rm = TRUE)
Q3 <- quantile(df[[feature]], 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
df[[feature]] <- ifelse(df[[feature]] < lower_bound | df[[feature]] > upper_bound, NA, df[[feature]])
}
return(df)
}
origin_outlier_cols <-c("Bilirubin", "Cholesterol", "Copper", "Alk_Phos", "Tryglicerides", "SGOT", "Prothrombin", "Platelets")
competition_outlier_cols <- c("Bilirubin", "Cholesterol", "Copper", "Alk_Phos", "Tryglicerides", "SGOT", "Prothrombin")
# Remove Outliers and imputing by random forest
cleaned_origin_data <- remove_outliers(df_cirrhosis_origin, origin_outlier_cols)
cleaned_competition_data <- remove_outliers(df_cirrhosis_competition_train, competition_outlier_cols)
cleaned_origin_data <- impute_missing_values(df_cirrhosis_origin, origin_outlier_cols)
cleaned_competition_data <- impute_missing_values(df_cirrhosis_competition_train, competition_outlier_cols)
# Outlier Test After
cleaned_competition_train <- lapply(numerical_features, function(f) perform_grubbs_test(cleaned_competition_data, f))
cleaned_origin <- lapply(numerical_features, function(f) perform_grubbs_test(cleaned_origin_data, f))
cleaned_competition_train <- do.call(rbind, lapply(cleaned_competition_train, function(x) data.frame(Feature = x$feature, p.value = x$grubbs_result$p.value)))
cleaned_origin <- do.call(rbind, lapply(cleaned_origin, function(x) data.frame(Feature = x$feature, p.value = x$grubbs_result$p.value)))
after_outlier <- merge(cleaned_competition_train, cleaned_origin, by = "Feature", suffixes = c("_Competition", "_Origin"))
print(after_outlier)
## Feature p.value_Competition p.value_Origin
## 1 Age 1.000000e+00 8.536309e-01
## 2 Albumin 1.742075e-02 1.350112e-02
## 3 Alk_Phos 9.371014e-07 1.057881e-06
## 4 Bilirubin 9.966206e-08 5.090156e-06
## 5 Cholesterol 1.114592e-09 8.666631e-08
## 6 Copper 1.152593e-07 1.879024e-06
## 7 N_Days 1.000000e+00 1.000000e+00
## 8 Platelets 1.000000e+00 1.520915e-01
## 9 Prothrombin 0.000000e+00 7.649354e-09
## 10 SGOT 7.960994e-09 1.805444e-07
## 11 Tryglicerides 0.000000e+00 3.309353e-12
#Origin dataset small (<5000) suits for shapiro test
par(mfrow=c(1, 3))
for (feature in numerical_features) {
origin_data <- na.omit(cleaned_origin_data[[feature]][cleaned_origin_data[[feature]] > 0])
log_origin_data <- log(origin_data)
sqrt_orgin_data <- sqrt(origin_data)
hist(origin_data, main=paste("Origin", feature, "-No Change"), xlab="Values", col="lightgreen", breaks=30)
hist(log_origin_data, main=paste("Origin", feature, "- Log"), xlab="Values", col="lightblue", breaks=30)
hist(sqrt_orgin_data, main=paste("Origin", feature, "- Sqrt"), xlab="Values", col="lightblue", breaks=30)
shapiro_origin <- shapiro.test(origin_data)
log_shapiro_origin <- shapiro.test(log_origin_data)
sqrt_shapiro_origin <- shapiro.test(sqrt_orgin_data)
origin_pval <- format.pval(shapiro_origin$p.value, digits=3)
log_origin_pval <- format.pval(log_shapiro_origin$p.value, digits=3)
sqrt_origin_pval <- format.pval(sqrt_shapiro_origin$p.value, digits=3)
print(paste0("Origin", feature, "- p-value:", origin_pval))
print(paste0("Origin Log", feature, "- p-value:", log_origin_pval))
print(paste0("Origin Sqrt", feature, "- p-value:", log_origin_pval))
}
## [1] "OriginN_Days- p-value:1.98e-05"
## [1] "Origin LogN_Days- p-value:1.09e-14"
## [1] "Origin SqrtN_Days- p-value:1.09e-14"
## [1] "OriginAge- p-value:0.0769"
## [1] "Origin LogAge- p-value:0.0181"
## [1] "Origin SqrtAge- p-value:0.0181"
## [1] "OriginBilirubin- p-value:<2e-16"
## [1] "Origin LogBilirubin- p-value:1.19e-07"
## [1] "Origin SqrtBilirubin- p-value:1.19e-07"
## [1] "OriginCholesterol- p-value:<2e-16"
## [1] "Origin LogCholesterol- p-value:2.41e-10"
## [1] "Origin SqrtCholesterol- p-value:2.41e-10"
## [1] "OriginAlbumin- p-value:0.000356"
## [1] "Origin LogAlbumin- p-value:4.22e-09"
## [1] "Origin SqrtAlbumin- p-value:4.22e-09"
## [1] "OriginCopper- p-value:<2e-16"
## [1] "Origin LogCopper- p-value:0.217"
## [1] "Origin SqrtCopper- p-value:0.217"
## [1] "OriginAlk_Phos- p-value:<2e-16"
## [1] "Origin LogAlk_Phos- p-value:5.72e-08"
## [1] "Origin SqrtAlk_Phos- p-value:5.72e-08"
## [1] "OriginSGOT- p-value:9.45e-12"
## [1] "Origin LogSGOT- p-value:0.723"
## [1] "Origin SqrtSGOT- p-value:0.723"
## [1] "OriginTryglicerides- p-value:<2e-16"
## [1] "Origin LogTryglicerides- p-value:0.116"
## [1] "Origin SqrtTryglicerides- p-value:0.116"
## [1] "OriginPlatelets- p-value:0.028"
## [1] "Origin LogPlatelets- p-value:3.3e-06"
## [1] "Origin SqrtPlatelets- p-value:3.3e-06"
## [1] "OriginProthrombin- p-value:5.34e-14"
## [1] "Origin LogProthrombin- p-value:1.38e-10"
## [1] "Origin SqrtProthrombin- p-value:1.38e-10"
# Competition dataset slightly more significant than 5000, We can still use random sample or qqnorm to check its normality
for (feature in numerical_features) {
compet_data <- cleaned_competition_data[[feature]][cleaned_competition_data[[feature]] > 0]
log_compet_data<-log(compet_data)
sqrt_compet_data<-sqrt(compet_data)
# Q-Q plots
qqnorm(compet_data, main=paste("Original", feature))
qqline(compet_data, col = 2)
# Q-Q plots for log-transformed data
qqnorm(log_compet_data, main=paste("Log", feature))
qqline(log_compet_data, col = 2)
# Q-Q plots for sqrt-transformed data
qqnorm(sqrt_compet_data, main=paste("Sqrt", feature))
qqline(sqrt_compet_data, col = 2)
}
# Transformation (numerical features)
# Copy dataset for comparison
normaled_origin_data <- cleaned_origin_data
normaled_competition_data <- cleaned_competition_data
# Origin dataset -> N-day, Age , Platelets and Albumin don't need transform, Copper and SGOT use square root, the rest log
sqrt_origin_cols <- c("Copper","SGOT")
log_origin_cols <- c("Bilirubin","Cholesterol","Alk_Phos","Tryglicerides","Prothrombin","Copper")
for (col in sqrt_origin_cols) {
normaled_origin_data[[col]] <- sqrt(cleaned_origin_data[[col]])
}
for (col in log_origin_cols) {
normaled_origin_data[[col]] <- log(cleaned_origin_data[[col]]+1)
}
# Competition dataset -> Age , Prothrombin, Albumin don't need transform, N_Days, SGOT, Copper and Platelets use square root, the rest log
sqrt_compet_cols <- c("N_Days","Copper","Platelets")
log_compet_cols <- c("Bilirubin","Cholesterol","Alk_Phos","Tryglicerides")
for (col in sqrt_compet_cols) {
normaled_competition_data[[col]] <- sqrt(cleaned_competition_data[[col]])
}
for (col in log_compet_cols) {
normaled_competition_data[[col]] <- log(cleaned_competition_data[[col]]+1)
}
# Check normality test after transformation
par(mfrow=c(1, 2))
for (feature in numerical_features) {
print(paste("Feature:", feature))
# Shapiro-Wilk tests
original_shapiro <- shapiro.test(cleaned_origin_data[[feature]])
transformed_shapiro <- shapiro.test(normaled_origin_data[[feature]])
print(c("Original Data Shapiro-Wilk p-value:", original_shapiro$p.value))
print(c("Transformed Data Shapiro-Wilk p-value:", transformed_shapiro$p.value))
# Original Data
qqnorm(cleaned_competition_data[[feature]], main=paste("Original", feature))
qqline(cleaned_competition_data[[feature]], col = 2)
# Transformed Data
qqnorm(normaled_competition_data[[feature]], main=paste("Transformed", feature))
qqline(normaled_competition_data[[feature]], col = 2)
}
## [1] "Feature: N_Days"
## [1] "Original Data Shapiro-Wilk p-value:" "1.98468230839289e-05"
## [1] "Transformed Data Shapiro-Wilk p-value:"
## [2] "1.98468230839289e-05"
## [1] "Feature: Age"
## [1] "Original Data Shapiro-Wilk p-value:" "0.0768507573419543"
## [1] "Transformed Data Shapiro-Wilk p-value:"
## [2] "0.0768507573419543"
## [1] "Feature: Bilirubin"
## [1] "Original Data Shapiro-Wilk p-value:" "3.84319112594613e-24"
## [1] "Transformed Data Shapiro-Wilk p-value:"
## [2] "1.29481754351789e-13"
## [1] "Feature: Cholesterol"
## [1] "Original Data Shapiro-Wilk p-value:" "1.66630109881766e-23"
## [1] "Transformed Data Shapiro-Wilk p-value:"
## [2] "2.20281179356298e-10"
## [1] "Feature: Albumin"
## [1] "Original Data Shapiro-Wilk p-value:" "0.000355916743441901"
## [1] "Transformed Data Shapiro-Wilk p-value:"
## [2] "0.000355916743441901"
## [1] "Feature: Copper"
## [1] "Original Data Shapiro-Wilk p-value:" "1.24346966243077e-18"
## [1] "Transformed Data Shapiro-Wilk p-value:"
## [2] "0.394153989517887"
## [1] "Feature: Alk_Phos"
## [1] "Original Data Shapiro-Wilk p-value:" "2.96631338684554e-24"
## [1] "Transformed Data Shapiro-Wilk p-value:"
## [2] "5.53477154949391e-08"
## [1] "Feature: SGOT"
## [1] "Original Data Shapiro-Wilk p-value:" "9.4479121752537e-12"
## [1] "Transformed Data Shapiro-Wilk p-value:"
## [2] "0.000217080523730069"
## [1] "Feature: Tryglicerides"
## [1] "Original Data Shapiro-Wilk p-value:" "1.70759613289767e-17"
## [1] "Transformed Data Shapiro-Wilk p-value:"
## [2] "0.0946767636988216"
## [1] "Feature: Platelets"
## [1] "Original Data Shapiro-Wilk p-value:" "0.0280479305265303"
## [1] "Transformed Data Shapiro-Wilk p-value:"
## [2] "0.0280479305265303"
## [1] "Feature: Prothrombin"
## [1] "Original Data Shapiro-Wilk p-value:" "5.33810097432427e-14"
## [1] "Transformed Data Shapiro-Wilk p-value:"
## [2] "7.51169580162857e-11"
# After transformation, the numerical columns are improved.
# Normalization/Scaling (numerical features if needed)
z_score_standardize <- function(x) {
(x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}
normaled_origin_data[, numerical_features] <- lapply(normaled_origin_data[, numerical_features, drop = FALSE], z_score_standardize)
normaled_competition_data[, numerical_features] <- lapply(normaled_competition_data[, numerical_features, drop = FALSE], z_score_standardize)
# Check again before SMOTE and Split validation dataset
normaled_competition_data$Source <- 'Competition Train'
normaled_origin_data$Source <- 'Origin'
ready_df <- bind_rows(normaled_competition_data, normaled_origin_data)
for (feature in numerical_features) {
p <- ggplot(ready_df, aes(x=Source, y=ready_df[[feature]], fill=Source)) +
geom_violin(trim=FALSE) +
labs(title=paste("Violin Plot of", feature, "by Source"),
x="Source",
y=feature) +
theme_minimal()
print(p)
}
for (feature in categorical_features) {
p2 <- ggplot(ready_df, aes(x=.data[[feature]], fill=Source)) +
geom_bar(position="dodge") +
labs(title=paste("Bar Plot of", feature, "by Source"),
x=feature,
y="Count") +
theme_minimal()
print(p2)
}
# Drop the Source col
normaled_competition_data$Source <- NULL
normaled_origin_data$Source <- NULL
# any(is.na(normaled_origin_data))
# any(is.na(normaled_competition_data))
# As we can observe, the skewness and outliers are reduced after data processing, and no more missing values
After processing and cleaning the data, this reduced the skewness, outliers, and no more missing values. Theoretically, it’s not necessary to one-hot encode ordinal data to a random forest model; however, after trial and error, one-hot encoding is better in accuracy and reduces false negative samples. Furthermore, I use the SMOTE method to unbalance simple issues in the learning portion of each class. However, dividing proportions is a struggle for me. Is it better to divide it to 1:1:1 or reduce its imbalance by a degree? Finally, I chose undersampled to 4:1:2 by trying the “perc. under” parameter with the model evaluation result.
library(randomForest)
library(neuralnet)
library(nnet)
set.seed(123)
# Multiclass Logistic Regression -> Step
set.seed(1)
multinom_model <- multinom(Status ~ ., data = train_data)
## # weights: 63 (40 variable)
## initial value 3621.026103
## iter 10 value 2516.167260
## iter 20 value 2299.424692
## iter 30 value 2233.922803
## iter 40 value 2167.644349
## iter 50 value 2141.634856
## final value 2139.524194
## converged
# summary(multinom_model)
# Random Forest -> decision tree
set.seed(1)
rf_model <- randomForest(Status ~ ., data = train_data)
# print(rf_model)
# DNN
set.seed(1)
dnn_model <- neuralnet(Status ~ ., data = train_data)
# plot(dnn_model)
library(caret)
## Multi Class Prediction
predictions_lr <- predict(multinom_model, newdata = valid_data)
predictions_rf <- predict(rf_model, newdata = valid_data)
predictions_dnn <- predict(dnn_model, newdata = valid_data)
# Apply the max probability to 1
predicted_indices_dnn <- apply(predictions_dnn, 1, which.max)
# Convert the class one to its correspond label and back to factor
predicted_labels_dnn <- levels(train_data$Status)[predicted_indices_dnn]
predicted_labels_dnn <- factor(predicted_labels_dnn, levels = levels(train_data$Status))
# Evaluation
actual_labels <- factor(valid_data$Status, levels = levels(train_data$Status))
cm_lr <- confusionMatrix(factor(predictions_lr, levels = levels(train_data$Status)), actual_labels)
cm_rf <- confusionMatrix(factor(predictions_rf, levels = levels(train_data$Status)), actual_labels)
cm_dnn <- confusionMatrix(predicted_labels_dnn, actual_labels)
print(cm_lr)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2
## 0 528 76 105
## 1 9 28 23
## 2 54 43 233
##
## Overall Statistics
##
## Accuracy : 0.7179
## 95% CI : (0.6903, 0.7444)
## No Information Rate : 0.5378
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4845
##
## Mcnemar's Test P-Value : 3.233e-16
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2
## Sensitivity 0.8934 0.19048 0.6454
## Specificity 0.6437 0.96639 0.8686
## Pos Pred Value 0.7447 0.46667 0.7061
## Neg Pred Value 0.8385 0.88547 0.8336
## Prevalence 0.5378 0.13376 0.3285
## Detection Rate 0.4804 0.02548 0.2120
## Detection Prevalence 0.6451 0.05460 0.3003
## Balanced Accuracy 0.7686 0.57843 0.7570
print(cm_rf)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2
## 0 559 39 68
## 1 5 84 3
## 2 27 24 290
##
## Overall Statistics
##
## Accuracy : 0.849
## 95% CI : (0.8264, 0.8696)
## No Information Rate : 0.5378
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7308
##
## Mcnemar's Test P-Value : 5.07e-13
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2
## Sensitivity 0.9459 0.57143 0.8033
## Specificity 0.7894 0.99160 0.9309
## Pos Pred Value 0.8393 0.91304 0.8504
## Neg Pred Value 0.9261 0.93744 0.9063
## Prevalence 0.5378 0.13376 0.3285
## Detection Rate 0.5086 0.07643 0.2639
## Detection Prevalence 0.6060 0.08371 0.3103
## Balanced Accuracy 0.8676 0.78151 0.8671
print(cm_dnn)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2
## 0 531 82 109
## 1 0 0 0
## 2 60 65 252
##
## Overall Statistics
##
## Accuracy : 0.7125
## 95% CI : (0.6847, 0.7391)
## No Information Rate : 0.5378
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4616
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2
## Sensitivity 0.8985 0.0000 0.6981
## Specificity 0.6240 1.0000 0.8306
## Pos Pred Value 0.7355 NaN 0.6684
## Neg Pred Value 0.8408 0.8662 0.8490
## Prevalence 0.5378 0.1338 0.3285
## Detection Rate 0.4832 0.0000 0.2293
## Detection Prevalence 0.6570 0.0000 0.3430
## Balanced Accuracy 0.7612 0.5000 0.7643
For survival analysis, we want our model to predict if the patient will be dead(C/CL vs. D) in N days; this can help the physician plan the interference strategy for that patient. In other words, if the patient is alive but the model predicts death (False Positive), this could cause unnecessary treatment; if the patient is dead but predicted alive(False Negative), this could miss the golden time to rescue their life and is irreversible for disease like Cirrhosis. Therefore, we want a high recall score and high accuracy. Besides, the sensitivity to differentiate deaths is also expected to be high. From the evaluation result, the random forest performs well but needs improvement, such as feature engineering or model fine-tuning.
| | Actual |
| Predict | DEAD | SURV | |
|--------:|:-------|----------|:--------:|
| DEAD | TP | FP |Precision |
| SURV | FN | TN | |
| | Recall | | |
library(ggplot2)
library(DALEX)
library(MASS)
## Multinom Logistic Regression
step_multinom_model <- stepAIC(multinom_model, direction = "backward")
summary(step_multinom_model)
full_model_aic <- AIC(multinom_model)
step_model_aic <- AIC(step_multinom_model)
full_model_feature <- length(coef(multinom_model)) - 1
step_model_feature <- length(coef(step_multinom_model)) - 1
model_comparison <- data.frame(
Model = c("Full", "Stepwise"),
AIC = c(full_model_aic, step_model_aic),
Features = c(full_model_feature, step_model_feature)
)
ggplot(data = model_comparison, aes(x = Model)) +
geom_bar(aes(y = AIC, fill = "AIC"), stat = "identity", position = position_dodge(), width = 0.4) +
geom_line(aes(y = Features * 100, group = 1, colour = "Features"), size = 1, linetype = "dashed") +
geom_point(aes(y = Features * 100, color = "Features"), size = 3) +
scale_y_continuous(name = "AIC Value", sec.axis = sec_axis(~./100, name = "Number of Features")) +
labs(title = "Model Comparison: AIC and Number of Features") +
scale_fill_manual(name = "", values = c("AIC" = "lightgreen")) +
scale_color_manual(name = "", values = c("Features" = "red")) +
theme_minimal()
multinom_selected_features<- step_multinom_model$coefnames
## Random Forest
rf_feature_importance <- importance(rf_model)
sorted_indices <- order(rf_feature_importance[, "MeanDecreaseGini"], decreasing = TRUE)
sorted_rf_feature_importance <- rf_feature_importance[sorted_indices, ]
print(sorted_rf_feature_importance)
rf_selected_features<-names(sorted_rf_feature_importance)[1:11]
print(rf_selected_features)
varImpPlot(rf_model)
## DNN
explainer_nn <- DALEX::explain(model = dnn_model,
data = train_data[, -which(names(train_data) == "Status")], # Exclude the response variable
y = as.numeric(train_data$Status), # Ensure this is numeric and properly corresponds to `Status`
label = "Neural Network")
explainer_nn %>% model_parts() %>% plot(show_boxplots = FALSE) + ggtitle("Feature Importance ", "")
feature_importance <- model_parts(explainer = explainer_nn, type = "raw")
print(feature_importance)
dnn_selected_features <- feature_importance$variable[2:11]
print(dnn_selected_features)
# Feature Selection and Spiting data again
# selected_cirrhosis_train<-normaled_cirrhosis_train2[,c("N_Days","Bilirubin","Prothrombin","Age","SGOT","Copper","Hepatomegaly.1","Status")]
#
# set.seed(234)
# train_indices2 <- sample(1:nrow(selected_cirrhosis_train), size = round(0.75 * nrow(selected_cirrhosis_train)))
# train_data2 <- selected_cirrhosis_train[train_indices2, ]
# valid_data2 <- selected_cirrhosis_train[-train_indices2, ]
After evaluating each model’s essential and significant features with different thresholds, surprisingly, the random forest did not have any categorical feature for its importance. In contrast, multi-class logistic regression and DNN has some encoded categorical columns.
After prioritize the importance in each model by each of the selection criteria, we got:
Logistic Regression: (Intercept), N_Days, Age, Bilirubin, Albumin, Copper, Alk_Phos, SGOT, Tryglicerides, Platelets, Prothrombin, Sex.1, Drug.1, Hepatomegaly.1, Spiders.1, Edema.1, Edema.2, Stage.4
Random Forest: Bilirubin, N_Days, Copper, Prothrombin, Age, SGOT, Alk_Phos, Platelets, Albumin, Cholesterol, Tryglicerides
DNN: Bilirubin, Copper, Prothrombin, N_Days, Hepatomegaly.1, Stage.4, SGOT, Edema.2, Spiders.1, Albumin
I was thinking of either using the selected feature based on its model or making intersection the essential feature for model evaluation and fine-tuning. If the second evaluation is not enhanced much, then there’s not much improvement that could be made in fine-tuning. Therefore, I choose to use selected features based on the model.
set.seed(234)
# Logistic Regression
set.seed(2)
multinom_model2 <- multinom(Status ~ ., data = train_data[,c("Status", multinom_selected_features[-1])]) # Remove intercept
## # weights: 57 (36 variable)
## initial value 3621.026103
## iter 10 value 2548.177201
## iter 20 value 2305.079316
## iter 30 value 2220.494425
## iter 40 value 2159.002567
## iter 50 value 2140.974363
## final value 2140.963096
## converged
# summary(multinom_model2)
# Random Forest -> decision tree
set.seed(2)
rf_model2 <- randomForest(Status ~ ., data = train_data[,c("Status", rf_selected_features)])
# DNN
set.seed(2)
dnn_model2 <- neuralnet(Status ~ ., data = train_data[,c("Status", dnn_selected_features)])
# plot(dnn_model2)
## Multi Class Prediction
predictions_lr2 <- predict(multinom_model2, newdata = valid_data)
predictions_rf2 <- predict(rf_model2, newdata = valid_data)
predictions_dnn2 <- predict(dnn_model2, newdata = valid_data)
# Apply the max probability to 1
predicted_indices_dnn2 <- apply(predictions_dnn2, 1, which.max)
predicted_labels_dnn2 <- levels(train_data$Status)[predicted_indices_dnn2]
predicted_labels_dnn2 <- factor(predicted_labels_dnn2, levels = levels(train_data$Status))
# Evaluation
actual_labels <- factor(valid_data$Status, levels = levels(train_data$Status))
cm_lr2 <- confusionMatrix(factor(predictions_lr2, levels = levels(train_data$Status)), actual_labels)
cm_rf2 <- confusionMatrix(factor(predictions_rf2, levels = levels(train_data$Status)), actual_labels)
cm_dnn2 <- confusionMatrix(predicted_labels_dnn2, actual_labels)
print(cm_lr2)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2
## 0 526 76 105
## 1 11 30 24
## 2 54 41 232
##
## Overall Statistics
##
## Accuracy : 0.717
## 95% CI : (0.6894, 0.7435)
## No Information Rate : 0.5378
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.484
##
## Mcnemar's Test P-Value : 5.829e-15
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2
## Sensitivity 0.8900 0.20408 0.6427
## Specificity 0.6437 0.96324 0.8713
## Pos Pred Value 0.7440 0.46154 0.7095
## Neg Pred Value 0.8342 0.88685 0.8329
## Prevalence 0.5378 0.13376 0.3285
## Detection Rate 0.4786 0.02730 0.2111
## Detection Prevalence 0.6433 0.05914 0.2975
## Balanced Accuracy 0.7669 0.58366 0.7570
print(cm_rf2)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2
## 0 558 44 66
## 1 6 79 3
## 2 27 24 292
##
## Overall Statistics
##
## Accuracy : 0.8453
## 95% CI : (0.8226, 0.8662)
## No Information Rate : 0.5378
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7237
##
## Mcnemar's Test P-Value : 2.717e-13
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2
## Sensitivity 0.9442 0.53741 0.8089
## Specificity 0.7835 0.99055 0.9309
## Pos Pred Value 0.8353 0.89773 0.8513
## Neg Pred Value 0.9234 0.93274 0.9087
## Prevalence 0.5378 0.13376 0.3285
## Detection Rate 0.5077 0.07188 0.2657
## Detection Prevalence 0.6078 0.08007 0.3121
## Balanced Accuracy 0.8638 0.76398 0.8699
print(cm_dnn2)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2
## 0 534 78 113
## 1 0 0 0
## 2 57 69 248
##
## Overall Statistics
##
## Accuracy : 0.7116
## 95% CI : (0.6838, 0.7382)
## No Information Rate : 0.5378
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4593
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2
## Sensitivity 0.9036 0.0000 0.6870
## Specificity 0.6240 1.0000 0.8293
## Pos Pred Value 0.7366 NaN 0.6631
## Neg Pred Value 0.8476 0.8662 0.8441
## Prevalence 0.5378 0.1338 0.3285
## Detection Rate 0.4859 0.0000 0.2257
## Detection Prevalence 0.6597 0.0000 0.3403
## Balanced Accuracy 0.7638 0.5000 0.7581
After feature selection, we generally reduce the dimension from 21 to 10 for random forest and NN model, with a limited influence on the accuracy. Besides, the False Negative sample was reduced in logistics and the NN model, which slightly improved the recall compared to the first evaluation. In addition, we checked with the AIC of the logistic regression model, and there is no change but with slight improvement for reducing the dimension.
At this point, we understand that the regression model and DNN are unsuitable for this dataset, and random forest might be the most suitable since it has high accuracy and the lowest case of false negatives among all three. So, I will continue with the fine-tuning process with Random Forest.
library(doParallel)
# Random Forest Fine-tuning
set.seed(345)
x <- train_data[,rf_selected_features]
y <- train_data$Status
mtry_values <- 1:ncol(x)
oob_errors <- numeric(length(mtry_values))
cl <- makeCluster(detectCores() - 1)
registerDoParallel(cl)
for (i in seq_along(mtry_values)) {
set.seed(3)
rf_temp <- randomForest(x, y, mtry = mtry_values[i], ntree = 1000, do.trace = FALSE)
oob_errors[i] <- rf_temp$err.rate[nrow(rf_temp$err.rate), "OOB"]
}
stopCluster(cl)
best_mtry_manual <- mtry_values[which.min(oob_errors)]
set.seed(3)
rf_model_tuned <- randomForest(x, y, mtry = best_mtry_manual, ntree = 120, importance = TRUE)
plot(mtry_values, oob_errors, type = "b", pch = 19, col = "blue",
xlab = "Number of Predictors (mtry)", ylab = "Out-of-Bag (OOB) Error Rate",
main = "OOB Error Rate Across mtry Values")
actual_labels <- factor(valid_data$Status, levels = levels(train_data$Status))
# Random Forest Final
predictions_rf3 <- predict(rf_model_tuned, newdata = valid_data)
cm_rf3 <- confusionMatrix(factor(predictions_rf3, levels = levels(train_data$Status)), actual_labels)
print(cm_rf3)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2
## 0 556 42 66
## 1 7 79 5
## 2 28 26 290
##
## Overall Statistics
##
## Accuracy : 0.8417
## 95% CI : (0.8187, 0.8628)
## No Information Rate : 0.5378
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7179
##
## Mcnemar's Test P-Value : 8.408e-12
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2
## Sensitivity 0.9408 0.53741 0.8033
## Specificity 0.7874 0.98739 0.9268
## Pos Pred Value 0.8373 0.86813 0.8430
## Neg Pred Value 0.9195 0.93254 0.9060
## Prevalence 0.5378 0.13376 0.3285
## Detection Rate 0.5059 0.07188 0.2639
## Detection Prevalence 0.6042 0.08280 0.3130
## Balanced Accuracy 0.8641 0.76240 0.8651
FN_improvement<-(cm_rf2$table[1,3]-cm_rf3$table[1,3])
During the process, I found that tuning the out-of-bag error is decisive for accuracy and false negative cases, and parallel computing is more efficient. As fine-tuning, the accuracy stays the same as the feature selected, but the False Negative reduced 0 ,which is essential for our evaluation criteria. In contrast, using more tree samples and parallel computing is unnecessary after the “mtry value” is decided. Besides, the sensitivity for class “D” is also enhanced. Now, we use the model for test data to predict its status.
selected_test_data<-test_data[,rf_selected_features]
# Aligned the
sqrt_cols <- c("N_Days", "Copper", "SGOT", "Platelets")
log_cols <- c("Bilirubin", "Cholesterol", "Alk_Phos", "Prothrombin")
# Apply sqrt and log transformations
for (col in sqrt_cols) {
selected_test_data[[col]] <- sqrt(selected_test_data[[col]])
}
for (col in log_cols) {
selected_test_data[[col]] <- log(selected_test_data[[col]] + 1)
}
selected_numerical_features <- numerical_features[!numerical_features %in% c("Tryglicerides")]
## Z-Score Normalization as train set
selected_test_data[, selected_numerical_features] <- lapply(selected_test_data[, selected_numerical_features, drop = FALSE], z_score_standardize)
test_data <- as.data.frame(selected_test_data)
# Test Set prediction
predictions_rf_Final <- predict(rf_model_tuned, newdata = test_data)
summary(predictions_rf_Final)
## 0 1 2
## 5008 324 2573
train_data_status <- data.frame(
Status = df_cirrhosis_competition_train$Status,
Type = "Train Data (Actural)"
)
predicted_test_status <- data.frame(
Status = as.character(predictions_rf_Final),
Type = "Test Data (Prediction)"
)
combined_data <- rbind(train_data_status, predicted_test_status)
ggplot(combined_data, aes(x = Status, fill = Status)) +
geom_bar(stat = "count", position = position_dodge()) +
facet_wrap(~ Type, ncol = 2) +
labs(fill = "Status",
x = "",
y = "Count",
title = "Status Proportion Comparison") +
theme_minimal() +
theme(legend.position = "bottom",
plot.title = element_text(hjust = 0.5))
To follow best practices, I use a trained and valid dataset for model building, and the test dataset is used only after the model is trained. These modifications ensure that the test data’s preprocessing mirrors the training data’s preprocessing. We apply outlier removal or SMOTE, influencing the model evaluation’s data integrity and validity. We only use the test set for transformation and normalization.
Since the test data is actual, we have no correct label to validate our final prediction. However, since the train and test competition data are simulated by the gradient boosting learning model, we can compare the proportion of the two datasets to check if our model causes any unbalanced prediction. Luckily, those two data are not in the same proportion but in a similar arrangement, which could indicate our model learning strategy works.
Dickson,E., Grambsch,P., Fleming,T., Fisher,L., and Langworthy,A.. (2023). Cirrhosis Patient Survival Prediction. UCI Machine Learning Repository. https://doi.org/10.24432/C5R02G.
Fleming, Thomas R., and David P. Harrington. Counting processes and survival analysis. Vol. 625. John Wiley & Sons, 2013.